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Abstract 

We address the problem of verifying clique avoidance in the TTP protocol. TTP allows 
several stations embedded in a car to communicate. It has many mechanisms to ensure 
robustness to faults. In particular, it has an algorithm that allows a station to recognize 
itself as faulty and leave the communication. This algorithm must satisfy the crucial 
'non-clique' property: it is impossible to have two or more disjoint groups of stations 
communicating exclusively with stations in their own group. In this paper, we propose 
an automatic verification method for an arbitrary number of stations N and a given 
number of faults k. We give an abstraction that allows to model the algorithm by means 
of unbounded (parametric) counter automata. We have checked the non-clique property 
on this model in the case of one fault, using the ALV tool as well as the LASH tool. 

KEYWORDS: Formal verification, fault-tolerant protocols, parametric counter automata, 
abstraction 



1 Introduction 

The verification of complex systems, especially of software systems, requires the 
adoption of powerful methodologies based on combining, and sometimes iterat- 
ing, several analysis techniques. A widely adopted approach consists in combining 
abstraction techniques with verification algorithms (e.g., model-checking, symbolic 
reachability analysis, see, e.g., (JChaf and Saidi 1997IIAbdulla et al~ 1999 Sa idi and Shankar 19 99 )) 
In this approach, non-trivial abstraction steps are necessary to construct faith- 
ful abstract models (typically finite-state models) on which the required proper- 
ties can be automatically verified. The abstraction steps can be extremely hard 
to carry out depending on how restricted the targeted class of abstract mod- 
els is. Indeed, many aspects in the behavior of complex software systems cannot 
(or can hardly) be captured using finite-state models. Among these aspects, we 
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can mention, e.g., (1) the manipulation of variables and data-structures (coun- 
ters, queues, arrays, etc.) ranging over infinite domains, (2) parameterization (e.g., 
sizes of the data structures, the number of components in the system, the rates of 
errors/faults/losses, etc.). For this reason, it is often needed to consider abstrac- 
tion steps which yield infinite-state models corresponding to extended automata, 
i.e., a finite-control automata supplied with unbounded data-structures (e.g., timed 
automata, pushdown automata, counter automata, FIFO-channel automata, finite- 
state transducers, etc.) l|Abdulla et al. 1 999). Then, symbolic reachability analy- 
sis algorithms (see, e.g., i|Cousot and Halbwachs 19781 |Boigelot and Wolper 1994| 
|Bouajjani et al. 1997||Bouajjani and Habermehl 1997|lKesten et al. 1997IIBultan et al. I 
Wol per and Boigelot | [Bouajja ni et al. 20 00 , Annichi ni et al. 2000IIAbdulla and .Tonsson~2 001a 
lAbdulla and .Tonsson 2001bjl ) can be applied on these (abstract) extended automata- 
based models in order to verify the desired properties of the original (concrete) sys- 
tem. Of course, abstraction steps remain non-trivial in general for complex systems, 
even if infinite-state extended automata are used as abstract models. 

In this paper, we consider verification problems concerning a protocol used in 
the automotive and aerospace industry. The protocol, called TTP/C, was designed 
at the Technical University of Vienna in order to allow communication between 
several devices (micro-processors) embedded in a car or plane, whose function is to 
control the safe execution of different driving actions JKopetz and Griinsteidl 19991 
IKopetz 1999| ). 

The protocol involves many mechanisms to ensure robustness to faults. In partic- 
ular, the protocol involves implicit and explicit mechanisms which allow to discard 
devices (called stations) which are (supposed to be) faulty. This mechanism must 
ensure the crucial property: all active stations form one single group of communi- 
cating stations, i.e., it is impossible to have two (or more) disjoint groups of active 
stations communicating exclusively with stations in their own group. 

Actually, the algorithm is very subtle and its verification is a real challenge for 
formal and automatic verification methods. Roughly, it is a parameterized algorithm 
for N stations arranged in a ring topology. Each of the stations broadcasts a message 
to all stations when it is its turn to emit. The turn of each station is determined 
by a fixed time schedule. Stations maintain informations corresponding to their 
view of the global state of the system: a membership vector, consisting of an array 
with a parametric size N, telling which stations are active. Stations exchange their 
views of the system and this allows them to recognize faulty stations. Each time a 
station sends a message, it sends also the result of a calculation which encodes its 
membership vector. Stations compare their membership vectors to those received 
from sending stations. If a receiver disagrees with the membership vector of the 
sender, it counts the received message as incorrect. If a station disagrees with a 
majority of stations (in the round since the last time the station has emitted), it 
considers itself as faulty and leaves the active mode (it refrains from emitting and 
skips its turn). Stations which are inactive can return later to the active mode 
(details are given in the paper). Besides the membership vector, each station s 
maintains two integer counters in order to count in the last round (since the previous 
emission of the station s) (1) the number of stations which have emitted and from 
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which s has received a correct message with membership vector equal to its own 
vector at that moment (the stations may disagree later concerning some other 
emitting station), and (2) the number of stations from which s received an incorrect 
message (the incorrect message may be due to a transmission fault or to a different 
membership vector) . The information maintained by each station s depends tightly 
on its position in the ring relatively to the positions of the faulty stations and 
relatively to the stations which agree/disagree with s w.r.t. each fault. 

The proof of correctness of the algorithm and its automatic verification are far 
from being straightforward, especially in the parametric case, i.e., for any number 
of stations, and any number of faults. 

The first contribution of this paper is to prove that the algorithm stabilizes to a 
state where all membership vectors are equal after precisely two rounds from the 
occurrence of the last fault in any sequence of faults. The proof is given for the gen- 
eral case where re-integrating stations are allowed. To guarantee stabilization after 
fc-faults in the case of re-integration, we propose an algorithm slightly different from 
the one presented in ( |Kopetz and Grunsteidl 1999| ) , which guarantees stabilization 
in the case of 1 fault only. The generalization to k faults makes an assumption 
on the failure model (made explicit in section EJ) that may not be realistic for a 
particular kind of messages called N- frames ( |Kopetz and Grunsteidl 19 99 ) . 

Then, we address the problem of verifying automatically the algorithm. We prove 
that, for every fixed number of faults k, it is possible to construct an abstraction of 
the algorithm (parameterized by the number of stations N) by means of a paramet- 
ric counter automaton. This result is surprising since (1) it is not easy to abstract 
the information related to the topology of the system (ordering between the sta- 
tions in the ring), and (2) each station (in the concrete algorithm) has local variables 
ranging over infinite domains (two counters and an array with parametric bounds) . 
The difficulty is to prove that it is possible to encode the information needed by all 
stations by means of a finite number of counters. Basically, this is done as follows: 

(1) We observe that a sequence of faults induces a partition of the set of active 
stations (classes correspond to stations having the same membership vector) which 
is built by successive refinements: Initially, all stations are in the same set, and the 
occurrence of each fault has the effect of splitting the class containing the faulty 
station into two subclasses (stations which recognizes the fault, and the other ones). 

(2) We show that there is a partition of the ring into a finite number of regions (de- 
pending on the positions of the faulty stations) such that, to determine at any time 
whether a station of any class can emit, it is enough to know how many stations 
in the different classes/zones have emitted in the last two rounds. This counting is 
delicate due to the splitting of the classes after each fault. 

Finally, we show that, given a counter automaton modeling the algorithm, the 
stabilization property (after 2 rounds following the last fault) can be expressed as a 
constrained reachability property (in CTL with Presburger predicates) which can be 
checked using symbolic reachability analysis tools for counter automata (e.g., ALV 
IjBultan and Yavuz-Kahveci 200 1|) or LASH l|LASH 2002(1 ). We have experimented 
this approach in the case of one fault. We have built a model for the algorithm in the 
language of ALV, and we have been able to verify automatically that it converges to 
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Fig. 1. A TDMA round for 3 stations. 

a single clique after precisely two rounds from the occurrence of the fault. Actually, 
we have provided a refinement of the abstraction given in the general case which 
allows to build a simpler automaton. This refinement is based on properties specific 
to the 1 fault case that has been checked automatically using ALV. 

The paper is organized as follows. Section [21 presents the protocol. In Section 
El we prove the crucial non-clique property for n stations: the stations that are 
still active do have the same membership vector at the end of the second round 
following fault k. Considering the 1 fault case, section^lpresents how to abstract the 
protocol parameterized by the number of stations n as an automaton with counters 
that can be symbolically model checked. Section generalizes the approach for a 
given number of faults k. Section concludes the paper. A preliminary version of 
this paper has appeared in | |Bouajjani and Merceron 2002| ). 

2 Informal Description of the Protocol 

TTP is a time-triggered protocol. It has a finite set S of N stations and allows 
them to communicate via a shared bus. Messages are broadcast to all stations via 
the bus. Each station that participates in the communication sends a message when 
it is the right time to do so. Therefore, access to the bus is determined by a time 
division multiple access (TDMA) schema controlled by the global time generated by 
the protocol. A TDMA round is divided into time slots. The stations are statically 
ordered in a ring and time slots are allocated to the stations according to their order. 
During its time slot, a station has exclusive message sending rights. A TDMA round 
for three stations is shown in Figure ^ When one round is completed, a next one 
takes place following the same pattern. 

TTP is a fault-tolerant protocol. Stations may fail while other stations con- 
tinue communicating with each other. TTP provides different services to ensure 
robustness to faults, such as replication of stations, replication of communica- 
tion channels, bus guardian, fault-tolerant clock synchronization algorithm, im- 
plicit acknowledgment, clique avoidance mechanism, QKopetz and Griinsteidl 19991 
|Kopetz 1999|IBauer and Paulitsch 20(~)0J> . Several classes of faults are distinguished. 
l|Steiner et al. 2004|) . for example, focuses on faults that may appear at startup. In 
this paper, we focuse on asymmetric faults. A symmetric fault occurs when a sta- 
tion is send faulty, i.e., no other station can receive it properly, or receive faulty, i.e., 
it cannot receive properly any message. Asymmetric faults occur when an emitting 
station is received properly by more than 1 station, but less then all stations. We 
allow asymmetric faults to occur and consider symmetric faults as a special case of 
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asymmetric faults. For the protocol to work well, it is essential that (asymmetric) 
faults do not give rise to cliques. In ( |Kopetz and Grunsteidl 1999) |Kopetz 1999| ) 
cliques are understood as disjoint sets of stations communicating exclusively with 
each other. In this paper, we focus on implicit acknowledgment and clique avoidance 
mechanism, to be introduced shortly, and show that they prevent the formation of 
different cliques, clique is cast in its graph theoretical meaning. 

2.1 Local Information 

When it is working or in the active state, a station sends messages in its time slot, 
listens to messages broadcast by other stations and carries local calculations. Each 
station s stores locally some information, in particular a membership vector m s 
and two counters, CAcc s and CFail s . A membership vector is an array of booleans 
indexed by S, the set formed by the N stations. It indicates the stations that 
s receives correctly (in a sense that will be made precise below). If s received 
correctly the last message, also called frame, sent by s', then m s [s'] = 1, otherwise 
m s [s'] = 0. A sending station is supposed to receive itself properly, thus m s [s] = 1 
for a working station s. The counters CAcc s and CFail s are used as follows. When 
it is ready to send, s resets CAcc s and CFail s to 0. During the subsequent round, s 
increases CAcc s by 1 each time it receives a correct frame (this includes the frame 
it is sending itself) and it increases CFail s by 1 each time it receives an incorrect 
frame. When no frame is sent (because the station that should send is not working) , 
neither CFail s nor CAcc s are increased. 

2.2 Implicit Acknowledgment 

Frames are broadcast over the bus to all stations but they are not explicitly ac- 
knowledged. TTP has implicit acknowledgment. A frame is composed of a header, 
denoted by h in Figure ^ a data field, denoted by data and a CRC field denoted 
by crc. The data field contains the data, like sensor-recorded data, that a station 
wants to broadcast. The CRC field contains the calculation of the Cyclic Redun- 
dancy Check done by the sending station. CRC is calculated over the header, the 
data field and the individual membership vector. When station s is sending, it puts 
in the CRC field the calculation it has done with its own membership vector m s . 
Station s' receiving a frame from station s recognizes the frame as valid if all the 
fields have the expected lengths. If the frame is valid, station s' performs a CRC 
calculation over the header and the data field it has just received, and its own mem- 
bership vector m s i ■ It recognizes the frame as correct if it has recognized it as valid 
and its CRC calculation agrees with the one put by s in the CRC field. Therefore, 
a correct CRC implies that sender s and receiver s' have the same membership 
vector. 

We also assume a converse: if s and s' do not have the same membership vector, 
the CRC is not correct. This assumption may be strong and may not be met by 
special messages of TTP called N-frames. However, it is valid for X-frames and 
I-frames. 
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During normal operation, the sender s performs two CRC checks over the frame 
received from its first successor s': 

Checkla CRC calculation with m s [s] = 1 and m s [s'] = 1. 

Checklb CRC calculation with m s [s] = and m s [s'] = 1. 
If the CRC Checkla is correct, s knows that s and s' have identical membership 
vectors, so it implicitly deduces that its successor s' has received its frame. Thus s 
assumes that it is not faulty, m s [s] remains 1 and Checklb is discarded. It increases 
CAcc s by one. One says that s has reached its membership point. 
If both Checkla and Checklb fail, it is assumed that some transient disturbance 
has corrupted the frame of s' . Thus m s [s'] is put to and CFail s is increased by 1. 
The next station becomes the first successor of s. 

When Checkla fails but Checklb passes, either s could be send faulty or s' could 
be receive faulty. According to the confidence principle, s assumes the latter, puts 
the membership of s' to and increases CFail s by 1. However, s performs further 
similar checks over the frame received from the next successor s" for double check: 

Checklla CRC calculation with m s [s) = 1 and m s [s'] = 0. 

Checkllb CRC calculation with m s [s] — and m s [s'] = 1. 
If Checklla passes, s is confirmed that s' is received faulty. It increases CAcc s by 1 
(for s") and Checkllb is discarded. Again, s has reached its membership point. 
If Checkllb passes, s assumes that it is itself send faulty and s' is non-faulty and 
leaves the active state. 

If both Checklla and Checkllb fail, s considers that s" is faulty. Therefore it puts 
m s [s"] = 0, it increases CFail s by 1 and will perform Checklla and Checkllb again 
with the next successor following the same procedure. 
It is assumed that at least 3 stations are active. 

2.3 Clique Avoidance Mechanism 

The clique avoidance mechanism reads as follows: Once per round, at the beginning 
of its time slot, a station s checks whether CAcc s > CFail s . If it is the case, it 
resets both counters as already said above and sends a message. If it is not the 
case, the station fails. It puts its own membership vector bit to 0, i.e., m s [s] = 0, 
and leaves the active state, thus will not send in the subsequent rounds. The 
intuition behind this mechanism is that a station that fails to recognize a majority 
of frames as correct, is most probably not working properly. Other working stations, 
not receiving anything during the time slot of s, put the bit of s to in their own 
membership vector. 

It should be noted that implicit acknowledgment and clique avoidance mechanism 
interfere with each other, which contributes to make the analysis of the algorithm 
difficult. 

2-4 Re-integration 

Faulty stations that have left the active state can re- integrate the active state (Kopetz 1999 
Bau er and Paulitsch 2 000 ) . The re-integration algorithm that we describe here dif- 



Parametric Verification of a Group Membership Algorithm 7 



fers slightly from the one proposed in ( |Kopetz and Griinsteidl 1999| ) to guarantee 
stabilisation after k faults. The algorithm proposed in ( |Kopetz and Griinsteidl 1999| 
is enough to guarantee stabilisation after 1 fault, but not after k faults. A major 
difference is that re-integration, with our algorithm, may last longer, since a station, 
once it has acquired a membership vector, has to listen at least a full round before 
beeing able to send a frame, which is not necessarily the case with the algorithm 
proposed in ( |Kopetz and Griinsteidl 1999). 

A re-integrating station s copies the membership vector from some active sta- 
tion. As soon as the integrating station has a copy, it updates its membership vector 
listening to the traffic following the same algorithm as other working stations. Dur- 
ing its first sending slot, it resets both counters, CAcc s and CFail s to 0, without 
sending any frame. During the following round, it increases its counters and keeps 
updating its membership vector as working stations do. At the beginning of its next 
sending slot, s checks whether CAcc s > CFail s . If it is the case, it puts m s [s] to 1 
and sends a frame, otherwise it leaves the active state again. Receiving stations, 
if they detect a valid frame, put the membership of s to 1 and then perform the 
CRC checks as described above. 



2.5 Example 

Consider a set S composed of 4 stations and suppose that all stations received 
correct frames from each other for a while. This means that they all have identi- 
cal membership vectors and CFail — 0. After station S3 has sent, the membership 
vectors as well as counters CAcc and CFail look as follows. Remember that there 
is no global resetting of CAcc and CFail. Resetting is relative to the position of the 
sending station. 



stations to[sq] ?n[s2] w[s3] CAcc CFail 



so 


1 


1 


1 


1 


4 





Sl 


1 


1 


1 


1 


3 





S2 


1 


1 


1 


1 


2 





S3 


1 


1 


1 


1 


1 






We suppose that a fault occurs while so is sending and that no subsequent fault 
occurs for at least two rounds, calculated from the time slot of s . We assume also 
that the frame sent by so is recognized as correct by S2 only. So the set S is split 
in two subsets, Si = {sq, S2} and So = {s±, S3}. 



1. Membership vectors and counters after sq has sent: 
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stations 


m[s ] 


m[si] 


m[s 2 ] 


m[s s ] 


CAcc 


CFail 


so 


1 


1 


1 


1 


1 





Sl 





1 


1 


1 


3 


1 


S 2 


1 


1 


1 


1 


3 





S3 





1 


1 


1 


1 


1 



2. Membership vectors and counters after s\ has sent. At this point Checkla 
fails but Checklb passes for sq. However, Checkla passes for S3 (because of 
the fault, both Checkla and Checklb failed in the preceding time slot for S3). 
Notice that s 2 does not have the same membership vector as s\, so it does 
not recognize the frame sent by S\ as correct. 



stations 


m[s ] 


m[si] 


m[s 2 ] 


m[s 3 ] 


CAcc 


CFail 


so 


1 





1 


1 


1 


1 


Sl 





1 


1 


1 


1 





s 2 


1 





1 


1 


3 


1 


S3 





1 


1 


1 


2 


1 



3. Membership vectors and counters after s 2 has sent. Now Checklla passes for 
So, but both Checkla and Checklb fail for si because its membership vector 
differs with the one of s 2 on s . s 3 does not have the same membership vector 
as s 2 , so it does not recognize the frame sent by s 2 as correct. 



stations 


m[s ] 


m[si] 


m[s 2 ] 


m[s s ] 


CAcc 


CFail 


so 


1 





1 


1 


2 


1 


Sl 





1 





1 


1 


1 


s 2 


1 





1 


1 


1 





s 3 





1 





1 


2 


2 



4. Memberships and counters after the time slot of S3, which cannot send due 
to the clique avoidance mechanism and leaves the active state: 
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stations] m[so] m[si] m[s2\ m[s 3 ] CAcc CFail 



so 
si 

S3 



5. Memberships and counters after so has sent again. At this point Checkla 
succeeds for s 2 while si is still looking for a first successor. 



stations] m[so] m[si] m[s2\ m[s 3 ] CAcc CFail 



so 
si 

S2 
S3 



1 

1 2 

2 




6. Memberships and counters after the time slot of s\, which cannot send due 
the clique avoidance mechanism and leaves the active state: 



stations] m[s ] m[si] m[s 2 ] m[s 3 ] CAcc CFail 



so 
si 

S2 
S3 



1 



2 





Membership vectors are coherent again at this point of time. 



3 Proving Clique Avoidance 

In this section we prove that if k faults occur at a rate of more than 1 fault per two 
TDMA rounds and if no fault occur during two rounds following fault k, then at 
the end of that second round, all active stations have the same membership vector, 
so they form a single clique in the graph theoretical sense. 

Let us denote by W the subset of S that contains all working stations. We may 
write m s = S' for a station s with S' C S as a short hand for m s [s'] = 1 iff s' e S'. 
To prove coherence of membership vectors we start with the following situation. 
We suppose that stations of W have identical membership vectors and all have 
CFail s = 0. Because m s [s] = 1 for any working station, this implies that m s = W 
for any s G W. Faults occur from this initial state. 

Let us define a graph as follows : the nodes are the stations, and there is an 
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arc between s and s' iff m s [s'] = 1. We recall that, in graph theory, a clique is a 
complete subgraph, i.e., each pair of nodes is related by an arc. Thus initially, W 
forms a single clique in the graph theoretical sense. 



3.1 Introductory Example 

Let us illustrate how things might work in the case of two faults. The first fault 
occurs when sq sends. We suppose that only si fails to receive correctly the frame 
sent by sq- S is split as Si = {so,s 2 ,S3} and So = {si}- Membership vectors and 
counters after Sq has sent. At this point, Checkla passes for s 3 . 



stations m[so] m[si] to[s 2 ] m[ss] CAcc CFail 





1 


1 


1 


1 


1 





si 





1 


1 


1 


3 


1 


S 2 


1 


1 


1 


1 


3 





S3 


1 


1 


1 


1 


2 






Notice that {s , s\, s 2 , S3}, do not form a clique anymore, the arc (si, s ) is missing. 

Membership vectors and counters after si has sent. At this point, Checkla fails 
but Checklb passes for so- S2 and S3 do not have the same membership vector as 
Si, so they don't accept its frame as correct. 



stations m[so] m[si] to[s 2 ] m[ss] CAcc CFail 



so 


1 





1 


1 


1 


1 


Si 





1 


1 


1 


1 





S2 


1 





1 


1 


3 


1 


S3 


1 





1 


1 


2 


1 



Membership vectors and counters after s 2 has sent. At this point, we suppose 
that a second fault occurs. Neither S3 nor so recognize the frame sent by s 2 as 
correct. Si is split in Sn = {s 2 } and S10 = {so, S3}. At this point, so keeps looking 
for a second successor and both Checkla and Checklb fail for s\. 



stations to[sq] m[si] m[s 2 ] m[s 3 ] CAcc CFail 



so 


1 








1 


1 


2 


Si 





1 





1 


1 


1 


s 2 


1 





1 


1 


1 





S3 


1 








1 


2 


2 
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Membership vectors and counters after the time slot of s 3 , which is prevented 
from sending because of the clique avoidance mechanism: 



stations m[so] m[si] m[s2\ m[ss] CAcc CFail 



so 


1 











1 


2 


Si 





1 








1 


1 


s 2 


1 





1 





1 





S3 





















One notices that sq, then si are prevented from sending because of the clique 
avoidance mechanism. Membership vectors and counters after the time slot of S\: 

stations] m[s ] m[si] m[s 2 ] m[s 3 ] CAcc CFail 



so 
si 

S2 
S3 







1 




Coherence is achieved again after the time slot of S\, where s 2 remains the only 
active station. Though Sn is smaller than Sio, the position of s 2 in the ring as 
the first station of the round with the second fault allows it to capitalize on frames 
accepted in the round before and to win over the set Sio- 



3.2 Proving a Single Clique after k Faults 

The proof proceeds as follows. First we show a preliminary result. If W is divided 
into subsets Si in such a way that all stations in a subset have the same membership 
vector, then stations inside a subset behave similarly: if there is no fault, they 
recognize the same frames as correct or as incorrect. Frames sent by stations from 
their own subset are the ones that they recognize as correct, while frames sent by 
stations from other subsets are all recognized as incorrect. 

Then we show that the occurrence of faults docs produce such a partitioning, i.e., 
after fault k, W is divided into subsets S W} where w <E {0, l} k . Indeed, as illustrated 
by the example above, after 1 fault, W is split in Si, the stations that recognize the 
frame as correct, and So, the stations that do not recognize the frame as correct. 
Because any station recognizes itself as correct, Si is not empty. Now, suppose that 
a second fault occurs. Assume that the second fault occurs when a station from set 
Si sends. As before, set Si splits into Sn, the stations that recognize the frame 
as correct, and Sio, the stations that do not recognize the frame as correct. Again 
Sn is not empty. Set So becomes Soo because stations in So don't have the same 
membership vector as stations in Si. And the process generalizes. If a station s 
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from a set S w sends when fault k occurs, S w splits into S w \ and S w o with S w \ ^ 0. 
Then, we show that two stations s and s' have the same membership vector if and 
only if they belong to the same set S w . Using the preliminary lemma, we have a 
result about the incrementation of the counters CAcc and C'Fail, namely, all stations 
from a set S w increment CAcc if a station from S w sends, and increment CFail if a 
station from S w > sends, where w ^ w' . From this, we can deduce our main result: 
in the second round after fault k, only stations from a single set S w can send. It 
follows that, at the end of that second round, there can be at most only one clique. 

First we give a lemma that says that, if two stations have the same membership 
vector, then they recognize mutually their frames as correct. 

Lemma 1 

Let s and s' G W with m s = m s > . Then, m s [s'] = 1 and m s '[s] = 1. 
Proof 

s,s' G W means m s [s] — 1 and m s ' [s r ] = 1. Because s and s' have the same 
membership vector, one has m s [s'} — 1 and m s '[s] = 1. □ 

Now we give our preliminary result when active stations are divided into subsets 
Si in such a way that all stations in a subset have the same membership vector. 

Proposition 2 

Suppose that W is divided into m subsets Si, ... , S m such that s and s' have the 
same membership vector iff s and s' belong to the same subset Si, 1 < i < m. Let 
s € Si, 1 < i < m. Assume that no fault occurs. Then, each time some other station 
s' is sending: 

1. if s' G Si, s increases CAcc s by 1 and keeps the membership bit of s' to 1, 

2. If s' e Sj, j^i, 

(a) either s increases CFail s by 1 and puts the membership bit of s' 
to 0, 

(b) or s leaves the active state. 

Proof 

We prove this proposition for the general case where integrating stations are allowed. 
Let s e Si, or, if s is a station about to integrate, suppose it has copied its mem- 
bership vector from an active station belonging to Si. 

First suppose s' G Si or, if s' is an integrating station, s' has copied its membership 
vector from some active station s" G 5,. Following the integration policy, both s 
and s' put the membership bit of s' to 1 when s' sends. Because s and s' have 
the same membership vector and because no other fault occurs, s recognizes as 
correct the frame sent by s'. So it increases CAcc s by 1 when s' sends and keeps 
the membership bit of s' to 1, also if s is performing Checkla or Checklla. 
Suppose now s' G Sj, j ^ i. As above, if s' is an integrating station, s' has copied its 
membership vector from some active station s" G Sj ■ Because s and s' do not have 
the same membership vector, s does not recognize the frame sent by s' as correct. 
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If s has reached its membership point already or is a station about to integrate, it 
increases CFail s by 1 and puts the membership bit of s' to 0. 

If s has not reached its membership point yet, it performs either checkl or Checkll. 
Suppose first s performs Checkla and Checklb, i.e., s' could be the first successor 
of s. Because s' G Sj, Checkla does not pass. Hence, s increases CFail s by 1 and 
puts the membership bit of s' to 0. 

Suppose now s performs Checklla and Checkllb, i.e., s' could be the second suc- 
cessor of s. If s and s' disagree only on the bit for s, then Checkllb fails for s 
which leaves the active state. Otherwise, s simply does not recognize as correct the 
frame sent by s'; it increases CFail s by 1, puts the membership bit of s' to (and 
continues looking for a second successor). □ 

Now we prove a crucial proposition. Namely, occurence of faults divide the active 
stations into subsets characterized by their membership vectors. 

Proposition 3 

At the end of the time slot of s k , the station where fault k occurs k > 1, W is 
partitioned into subsets S w , with w 6 {0,l} fc , such that two stations s 6 S M and 
s 1 G S w i have the same membership vector iff w = w'. 

Proof 

We proceed by induction on k. 

Basis : k = 1. Let s 1 be the station which is sending when the first fault occurs. 
Before the time slot of s l , all active stations, plus s 1 if s 1 is an integrating station, 
have the same membership vector, namely W, by hypothesis on the initial state. 
In case s 1 is an integrating station, all stations put the membership bit of s 1 to 1 
by the integrating station policy. We denote by So the subset of W that failed to 
receive correctly the frame sent by s 1 while we denote by Si the subset of stations 
that accepted the frame as correct. 

All stations in Si could receive correctly the frame sent by s , consequently their 
membership vectors do not change. None of the station in 5*o could receive correctly 
the frame sent by s 1 because of the fault. Hence they all put the membership bit 
of s 1 to false, also in the case where s 1 was an integrating station. Thus they all 
have the same membership vector, namely W \ {s 1 }. 
Suppose the result is true till fault k — 1. 

Induction step : we prove that the result is true for k, k > 1. Let s k be the 
station which is sending when fault k occurs. By induction hypothesis, at the end 
of the time slot of s k , W is partitioned into sets S w with w G {0|l} fc_1 . There is 
a unique w G {0|l} fc_1 with s k G S w or such that the membership vector of s k is 
identical to stations of S w , in case s k is an integrating station. 
First, we show that S w splits into S w q and S w i. By Lemma ^ w s [s fe ] = 1 for any 
station s G S w . Let us denote by S w i the subset of S w that could receive s k correctly 
and by S w o the subset of S w that could not receive s k correctly. Obviously, S w i and 
S w o partition S w . After the time slot of s k , all stations in S w i keep the membership 
of s k to 1, while all stations in S w o put it to 0. Thus s,s' G S w have the same 
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membership vector if and only if s, s' 6 S w i or s, s' 6 S w q. 

Consider some w' G {0| 1 , with w' ^ w and S^/ ^ 0. Let s' € S^. We show that 
S'uj' becomes Su/o and that the condition on the membership vectors still holds. 
w' =/= w means that the m s i and m s k differ on some bit s*. So, s' cannot recognize 
as correct the frame sent by s k and S w > can be denoted by S w >o ■ 
Obviously, Vs G S w \ : m s > ^ m s . Could it be that now stations S w >q and S w q 
have the same membership vector? This could be only if their membership vectors 
differed only on the bit for s k . But this would mean that station s k has already 
emitted, that a fault occured and station in S w > did not accept the frame as correct 
while stations in S w did. Because all stations emit in one round, some station u 
from S w ' has emitted. By Proposition^ membership vectors of stations in S w > and 
S w differ on u and u ^ s k . Hence, stations S w >o and S w o have different memberships 
vectors. Using a similar argument, membership vectors of stations in S w »o and S w >o 
remain different with w" G {0|l} fc_1 , w' ^ w" ^ w and S w » ^ 0. □ 

Finally, we show that only stations from a unique set S w are able to send in the 
second round following the first fault. 

Theorem 4 

Suppose some station is able to send in the second round following fault k. Let 
us denote this station by s. By Proposition exists some w G {0|l} fc such that 
s G S w . Then, only stations from S w can send in the second round following fault 
k. 

Proof 

Let CAcc s and CFail s when s performs the clique avoidance mechanism. By Propo- 
sition |3 CAcc s =\{s' G S w s. t. s' sent in the first round following fault k} | and 
CFail s — Yi W i^ w | {s' G S w > s. t. s < s', and s' sent in the first round following fault 
k} |, where < refers to the statical order among stations. Because s is able to send, 
one has CAcc s > CFail s at the beginning of the time slot of s in the second round. 
Let t be the follower station of s in the statical order ready to send after s. Is t 
able to send, or is it prevented from sending by the clique avoidance mechanism? 
We show that t is able to send if and only if it belongs to S w . 
Suppose t € S w . When its time slot comes in the second round, it has increased 
CAcct by 1 when s has sent in the second round. Because, in the first round following 
fault k, t increases its counters as s does by Proposition \2\ one has CAcct = CAcc s , 
or CAcct = CAcc s + 1 in case s was an integrating station, and CFailt — CFail s at 
the beginning of the time slot of t in the second round. Thus CAcct > CFailt and 
t is able to send as well. 

Suppose now t G S w > for some w' ^ w. CAcc t =| {s' G 5^' t < s' < s k and s' 
sent in the first round following fault k} |. Indeed, between s k and its present time 
slot, t has not accepted any frame since only s has sent. However, by Proposition^ 
CAcct < CFail s since all frames accepted by t are not recognized as correct by s. 
For a similar reason, CAcc s < CFailt- It follows that, at the beginning of its time 
slot, CAcct < CFailt and t is prevented from sending. 
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A similar argument can be repeated to all stations ready to send in the second 
round giving the result. □ 

From Theorem one deduces that, at the end of the second round following 
fault fc, for any station s G S w : m s = S w . Using Lemma El this gives our safety 
property about cliques. 

Corollary 5 

At the end of the second round following fault 1, all working stations form a single 
clique in the graph theoretical sense. 

4 Automatic Verification: the 1 Fault Case 

In the case of a single fault, the set W of active stations is divided into two subsets, 
Si and Sq. The set Si is not empty as it contains s , the station that was sending 
when the fault occurs. We assume that no other fault occurs for the next two 
rounds, a round is taken with the beginning of the time slot of s 1 . We want to prove 
automatically for an arbitrary number N of stations that, at the end of the second 
round following the fault, all working stations form a single non-empty clique. To 
achieve this goal, we need a formalism to model the protocol and a formalism to 
specify the properties that the protocol must satisfy. To model the protocol, we 
take synchronous automata extended with parameters and counters. To specify the 
properties, we take the temporal logic CTL. To keep the number of parameters as 
low as possible, we do not consider re-integrating stations. 

To be able to verify automatically the protocol for the parametric case where 
the number of stations is a paramater N, we need to abstract the behavioural 
model of the N identical extended automata into a single extended automaton. 
The abstraction we use is the standard (infinite) counter abstraction and can be 
automatized. The novelty and difficulty of our case study, compared with other 
examples using a similar abstraction technique IjPnueli et al. 20021 IDelzanno 2 000 1, 
lies in the fact that each individual extended automaton that models one station has 
local infinite variables : two counters, CAcc and CFail, and a membership vector 
m that all depends on the paramater N. Applying counter abstraction directly 
would lead to a too coarse model, useless for verification. Consequently, before we 
apply the abstraction, we perform a transformation of the N extended automata 
in order to replace local variables in guards by a finite number of global counters. 
The successive models we obtain are related by a (bi)simulation property. 

We divide the presentation in four main parts: first, we draw straightforward 
consequences from Section |3 for the 1 fault case. Second, we give the behavioural 
model of the protocol under the form of N synchronous automata with local vari- 
ables. Third we show how we build the abstract model replacing local variables by 
a finite number of global counters, strenthening guards and performing the usual 
counter abstraction in such a way that each successive model (bi)simulates the pre- 
vious one. Finally we give the properties that have been automatically proved on 
the resulting model which consists of a single extended automaton. This establishes 
the 'non-clique' property. 
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4-1 Properties of the 1-fault case 

We will make use of the results presented in this section in the stepwise transfor- 
mations where local variables are replaced by a finite number of global counters. 

Proposition 6 

Let s be the sending station when the fault occurs. In the round following the fault, 
the 3 conditions below are equivalent : 

1. s leaves the active state after Checkl and Checkll, 

2. the two follower stations of s have 1 everywhere in their membership vector, 
except for the bit for s which is 0, 

3. the two follower stations of s are in Sq. 

Proof 

A simple consequence of the fact that before the fault, all membership vectors are 
all equal with 1 in each bit. □ 

Proposition 7 

Let s be the sending station when the fault occurs and suppose that no fault occurs 
during the two subsequent rounds. Then, only station s can leave the active state 
because of Checkl and Checkll in the round following the fault. In later rounds, 
Checkl and Checkll do not play any role. 

Proof 

If s has left because of Checkl and Checkll, there is nothing to prove. Otherwise, 
the first or second successor of s belongs to Si, so either Checkla or Checklla 
succeeds in this round or in later rounds. Consider now s' ^ s. Could s' leave the 
active state because of Checkl and Checkll ? Consider s" the station sending after 
s'. Suppose that Checkla fails (otherwise there is nothing to prove). This means 
that s' and s" have different membership vectors. Because no new fault occurs, 
the difference in the membership vector can not be at the bit for s' only. Thus s' 
discards both Checkla and Checklb and keeps looking for a first successor. □ 

The result below shows that we can replace the N individual counters CAcc and 
CFail by two global counters dO and dl. Let dl be a counter to count how many 
stations of S± have sent so far in the round since the fault occurred. Let dO be a 
similar counter for Sq. Let s be a station ready to send. Assume s £ S±. How much 
is CFail s ? It is exactly given by dO. How much is CAcc s ? Generally, it is more than 
dl. One has to add all stations that have emitted before the fault since the last 
time slot of s, because s has recognized them all as correct, see FigureEl However, 
this number can be calculated exactly with the help of dl and dO only as Theorem 
03 shows. 
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*~ dl : accepted 

all accepted dO : not accepted 

Fig. 2. Illustrating proof of Theorem [S] 
Theorem 8 

Let s a station ready to send and suppose there is only 1 fault.. 
In the round following fault 1: 

1. If s £ Si, then CAcc s =| W\ -dO and CFail s = dO. 

2. If s e S , then CAcc s =| W \ -dl and CFail s = dl. 

In later rounds: 

1. If s £ Si, then CAcc s =\Si\ and CFail s =\S Q \. 

2. If s £ So, then CAcc s =\S \ and CFail s =\ Si |. 

Proof 

First round after the fault. We prove the first item only, the second one being dual. 
Since there was no fault in the last round, s has recognized as correct all stations 
between itself and s 1 in the last round. By Proposition [21 it has recognized as 
correct all stations of Si that have sent so far in the round. This is illustrated in 
Figure EJ Thus CAcc s =| {s' \ s < s' < s 1 } n W | +dl, where < and < refer the 
statical order among stations, {s' \ s < s' < s 1 }(~]W — W\({s' s 1 < s' < s}nW). 
Thus | {s' | s < s' < s 1 } n W | = | W | -dl - dO. This gives CAcc s =\ W \ -dO. By 
Proposition [21 CFail s = dO. 

Second and later rounds after the fault. Let s £ Si be the station whose time 
slot comes first in the second round. This station is the faulty station itself. By 
Proposition[21 CAcc s is the number of stations of Si that could send in the previous 
round, i.e., that are still active, which is precisely |Si |. Similarly, CFail s =\ So |. If 
s can send, | Si | remains unchanged, if s cannot send, | Si | diminishes by 1. At the 
beginning of the follower time slot, | Si | is exactly the number of stations from Si 
that have send in the round preceding this time slot, and similarly for So- Let us 
call s' the station corresponding to that time slot. If s' £ Si, using Propositions El 
CAcc s , =| Si | and CFail s =| S |. If s' £ S , the dual is true. If s' £ S U Si, s' 
has not send in the preceding round. In any case, at the end of the time slot of s', 
Si | is exactly the number of stations from Si that have send in the round, and 
similarly for So- Using Proposition[21and repeating the same argument for all time 
slots that follow gives the result. □ 



4-2 Behavioural model 



We give in this section a formal description of the TTP membership algorithm. For 
that, we consider a general specification formalism for parametrized networks. We 
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assume that systems can have an arbitrary number N of components which may 
share global variables and can have also local variables. Moreover, these components 
can communicate by broadcasting messages. To describe the behaviour of each of 
the components, we adopt an extended automata-based formalism (described using 
the notations of IjManna and Pnueli 1995(0 where each transition between control 
state is a guarded command which may involve (broadcast) communications. We 
assume that the executions of all the parallel components are synchronous following 
the semantical model of | |Benveniste and Berry 1991] ) (i.e., all components have the 
same speed w.r.t. a global logical clock defining a notion of execution step in the 
system, and at each step, all the operations, including broadcast communications, 
are instantaneous). 

The formal semantics of such models can be defined in terms of a transition 
system. A state defines a global configuration of the network corresponding to the 
values of the global variables together with the values of all local variables (including 
the control states) for each of the N components of the network. Transitions between 
states are defined straightforwardly according to the semantical model mentioned 
above. The behaviours of the network are defined as the possible execution paths 
in the so defined transition system. 

Before giving the formal description of the TTP, let us introduce some notations. 
We denote the sending of a broadcast message a by all and the receiving of a 
broadcast message a by all. Otherwise, |] denotes non-deterministic choice, i®® 
stands for t := t ©1, i.e., the value of t in the next state is incremented by 1 modulo 
N, A ++ stands for A := A+l, and '; ' denotes sequentiality. Assignments separated 
by ' , ' on the right side of — ► could happen in any order. Variables not mentioned 
in transitions remain unchanged. 

Figure |3| gives the formal specification of the TTP membership algorithm. 

The protocol is composed of two inputs and N processes P[i] running in parallel. 
The inputs of the protocol are the parameter N and a boolean fault which is true 
in case a fault occurs. Each station P is a non-deterministic finite state machine 
extended with local variables. The local variables are the counters A and F for CAcc 
and CFail, the membership vector m, the variable t to keep track of the time slots, 
and the variable s to remember the identity of the station which is sending when 
a fault occurs. Following (Man na~and Pnueli 1995J1 . all local variables are marked 
with the identity of the station they belong to, which is denoted by 

The parallel composition is synchronous and the automata synchronize on the 
broadcast message emit. 

There are four locations li n , Iq, h and If- Location U n is the initial location or 
state. When a fault occurs, stations that recognize the fault move to state 1$ while 
stations that do not recognize the fault move to l\. Stations who are prevented from 
sending move to location lp- 

Let us go through all transitions of location Zj n in details. In the initial state U ni 
all stations have the same membership vectors. First, suppose that there is no fault. 
A station can always emit when it is its turn to do so, expressed in the model by 
the condition t[i] = i. The first transition has t[i] = i A -> fault as guard. The 
action performed by that transition is: emitU, i[i]®®, A[i] := 1; k n . Put informally, 



A[i\ 
m[i] 
s[i] 
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in N : int,iV > 1 in fault : boolean signal 
WfLi P[i], where P[i] is the following program: 
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loop forever do 



int, init N + 1 — i 
array of boolean, init 1 
int, init 0// identity of the faulty station 



int, init 
int, init 1 // t for turn 



If 



— i A —< fault 

■ emit[\,t[i\ 99 ,A[{\ := 1; l in 
= i A fault 

■ emit\\,t[i\ 99 ,A[{\ := l,s[i] 
/i A emit?? A -> fault 

■ t[t\°°,A[i\ ++ ; l in 

A emit?? A /auZt 

■ i[i]®®,.4[i]+ + , S [i] :=«[*]; /r 
7^ i A emit?? A fault 

■ t\i] (s(s ,F[i\++,s[i\ :=t[i\,m\i][t[i\] ■= 0; h 

= i A > F[i] 

■ emii!!,i[i]®®, :=l,F[i]:=0; l 
= i A < F[i] 

■ -^emit\\,t[i\®®; l F 

i A emit?? A m[t[i]] = m[i] 

■ t[t\°°,A[{[ ++ ; l 

/i A emit?? A m[t[i]] ^ m[i] 



++ 



mi ti 



i[i]®®,F[i] 
/i A -lemit?? 
■ t[t] ee ,m[i][t[t]] :=0; l 



0; J 



= i A > F[i] 

emii!!,t[i]®®,A[i] := l,F[i] := 0; Zi 
= i A < F[i] 

-.errat!!,t[i]®®; Zf 
/ i A emif?? A i = s[i] A = s[i] + 2 A 
m[t[t]][8[t]] =m[«W-l][a[t]]=0) 
- t[z]®®; l F 

i A emif?? A i = s[i] A t[i] = s[i] + 2A 
m[t[i]}[s[i\}=m[t[i\-l}[s[i\}=0) 
>t[t\°°,A[i\ ++ ; h 

i A emit?? A (i / s[i] V / s[i] + 2) A m[t[i}\ = m[i] 

> t[t\°°,A[{[ ++ ; h 

=£ i A emit?? A (i / s[i] V / s[i] + 2) A m[t[i]] / m[i] 

> i[i]®®,F[i]++,m[#[i]] :=0; /i 
/i A -iemit?? 

> ^]®®,m[#[i]] :=0; /i 

= i 

> -.erra*!!,t[i]®®; Z F 

> t[i]®® ; z F 



end loop 

Fig. 3. The algorithm with N stations and 1 fault, Ml. 



20 



A. Bouajjani and A. Merceron 



when it is its turn to emit and there is no fault, a station sends a frame, increments 
the variable t by 1 modulo N, resets the counter A to 1 and, then, remains in state 
li n . The third transition models the behaviour of a receiving station when there is 
no fault. In that case, it always recognizes as correct frames sent by other stations. 
This is expressed in the model by the guard t[i] ^ i A emit?? A -> fault. These two 
guards, from the first and third transition, for two different processes % and j are true 
simultaneously. Indeed, in the synchronous model of computation, a computational 
step 'takes no time' ( |Benveniste and Berry 1991] ), therefore the emission of emit by 
process i is synchronous with its reception by all other processes. When the parallel 
statement is finished, all stations have incremented t by 1, the emitting station has 
reset A to 1 while other stations have incremented A. Counter F stays at its initial 
value since it is not incremented in transitions containing the condition ->fault. 

Suppose now that there is a fault, which is modeled by the condition fault. 
For an emitting station, this is the second transition. The action taken by the 
emitting station is similar to the first transition, except that it records its identity 
with s[i] .= t[i] and then moves to location l\ since, by the confidence principle, a 
station never thinks of itself as faulty (rather receiving stations are faulty). For a 
receiving station the guard t[i] ^ i A emit?? A fault is true. The occurrence of an 
asymmetric fault is modeled by a non-deterministic choice represented by the fourth 
and fifth transitions. The information of whether a station recognizes the fault is 
recorded in the control via locations lo and l\. If a receiving station recognizes the 
fault, it increases F by 1, F[i] ++ , memorizes the identity of the faulty station, 
s[i] :— t, puts the membership bit of the sending station to 0, m[i][i[i]] := and 
moves to Iq (fifth transition). If a station does not recognize the fault, it increases 
A by 1, A[i] ++ , and moves to l\. 

In location Zo , a station behaves as follows. In its time slot, t[i] — i, either it 
passes the clique avoidance mechanism, A[i] > F[i] (first transition), and emit, 
emitll, reset its counters, A[i] := := 0, increments the time slot, i[i] ee , and 

stays in Iq, or the clique avoidance mechanism fails A[i] < F[i] (second transition), 
it cannot emit, -^emitll, and goes to the fail state If- Outside its time slot, t[i] ^ i, 
if a message has been broadcast, emit??, and it has the same membership vector 
as the sending station, m[i[i]] = m[i] (third transition), it increases the counter of 
accepted messages, increments the time slot, f[i]®®, and stays in lo; if it 

does not have the same membership vector as the sending station, m[t[i]] ^ m[i] 
(fourth transition), it increases the counter of failed messages, puts the 

membership bit of the sending station to 0, m[i][i[i]] := 0, increments the time 
slot, and stays in Iq. Finally, if no message has been broadcast, ->emit?? 

(fifth transition), it puts the membership bit of the station that failed sending to 
0, m[i][i[i]] := 0, and stays in Iq. 

In location l\, transitions are similar except that two more transitions are added 
to cover the result of Checkl and Checkll using Propositions El and These are 
the third and the fourth transitions. If the sending station is the second successor 
of the faulty station, t[i] = s[i] + 2, and the two successors of the faulty station 
have the membership bit of s[i] to 0, m[i[i]]s[i] = m[t[i] — l]s[i) = 0, then station 
s[i] moves to If, otherwise station s[i] stays in l\. 
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In location lp, stations keep only track of the time slot, they cannot send and 
stay there. 

4-3 Construction of an abstract model 

We show in this section the construction of a counter automaton which is an ab- 
stract model of the parametrized membership algorithm for an arbitrary number of 
components N. In order to simplify the presentation and the proof of the abstrac- 
tion, we present this construction in several basic steps. The aim of the first steps is 
to encode the infinite-data-domain local variables of the N components with a finite 
number of global variables (counters). Then, the last step is a counter abstraction 
which encodes the control configurations (for N components) with global variables 
counting the number of components at each control location. 

In the sequel, we give the different abstraction steps by defining each time the 
abstract model and by showing that it (bi)simulates the previous one. 

Jf.,3.1 Eliminating identical locals t and s 

The first transformation we perform is to replace N local variables t[i] and s[i] by 
two global counters to and sq- It relies on the following fact: 

Lemma J^.l 

The two following properties are invariants of the program Ml: 

1. VtJ G {1...JV} : (t[i] = t\j] A s[i\ = s[j}) 

2. 3H G {1 .. . N} : t[i] = i 

In other words, at any computation step, all processes P[i] have identical values 
for locals s[i] and t[i] and there is exactly one process whose identity is t[i]. We 
define a program Ml where these local variables are replaced by to and sg in the 
following manner. We modify the transitions in order to encode the updates of the 
local variables as updates of the global ones. Since all processes are synchronous, 
these updates must be done by exactly one component. We choose that this will be 
done by the component for whom it is the turn to emit, i.e., who satisfies t[i] = i. 

So, the program Ml is obtained from Ml by applying the following transforma- 
tions: (1) initialize to and sq to 1 and respectively, (2) replace each occurrence 
of t[i] in the guards by tG, and (3) in all the guarded commands, if t[i] — i appears 
in the guard, then replace t[i] by to (resp. s[i] by sg), else remove the update 
statements of t[i] and s[i). 

For example the second transition at location ij n , see Figure is transformed 
into 

t G = i A fault — > emit\\,tQ®,A[i] := 1, s G ~ t G ; h 

We establish now that Ml bisimulates Ml. For that, let us consider the relation 
a\ t 2 between states of Ml and M2 such that, for every a% a state of Ml, and for 
every 02 a state of M2, we have 0:1,2(01, 02) if and only if (1) 02 and a\ coincide 
on all locations and variables different from the locals t and s, and the globals tc 
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and s G , and (2) 02 (£g) = °1 an d 02 (sg) = cr i( s [*]) for every i G {1, ...,N}, 
i.e., the value of t G and sg in 02 is the same as the value of t[i], respectively s[i] in 
G\. Then, it can easily be checked that the relation 0^2 is a bisimulation between 
the transition systems of Ml and M2. 

4-3.2 Eliminating locals A and F 

Based on Theorem [SJ we define a new model where local variables A and F are 
simulated by global counters do, di, Co, and C\\ the counters Co and C\ stand for 
So I and | S\ | respectively. We need in addition a variable r to count the current 
round and a variable C p which counts the number of steps performed in the current 
round. 

We define hereafter a program M3 obtained from M2 by the following transfor- 
mations: (1) transitions starting at location k n where the fault is detected (fault 
appears in the guard), replace A[i] and F[i] by C\ and Co respectively. (2) each 
transition starting from locations Iq and l\ which corresponds to an emit action 
(where to = i appears in the guard) is duplicated into three transition correspond- 
ing to the cases where the execution is inside the first round, is inside some later 
round, or is precisely at the beginning of a new round. The comparisons of A and 
F in the guards and their updates are replaced by corresponding comparisons and 
updates on Co, C\, do, and d\, according to Theorem |5J (3) all other statements 
involving A and F are removed. 

For example the fourth transition at location ij n : 

( G /iA emit?? A fault — > t%® , A[i] := 1; h 
is transformed into 

t G ^i A emit?? A fault — ► t® e ,Ci®®; h. 
The second transition at location lo : 

i G = i A A[i]<F[i] — > -^emit\\,t%®;l F 
is duplicated into three transitions ^1,^2, A3 as follows: 

t G = i A C a + C 1 <2xd 1 A C P <N A r = — ► -^emitU, A® e , C++, C ""; l F 
t G = i A C <Ci A C P = N — ► -nemiA!!,A§ e ,C p := 1, r++, C^"; Z F . 
t G = i A C a <d A C P <N A r>0 — ► —>emit\\, Aq®, C++, Cq^ ; lp 

The obtained program M3 is bisimilar to M2 by Theorem |HJ 

4-3.3 Eliminating the local m 

At this stage, we can see that the information given by m is not relevant anymore 
except in the case where a faulty station must leave the active state because its 
first and second successors have recognized it as faulty. To deal with this case, we 
replace the test on the vector m by the faulty station with a guess of the diagnostic 
of its two immediate successors (checkl and checkll). This guess is made exactly 
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one round after the fault. For that, we simply perform a nondctcrministic choice 
whether to leave or to stay in the active state. It turns out that this abstraction is 
precise enough for our purpose. 

Then, we define a new program MA obtained from the program M3 by (1) 
removing the transitions starting from l\ involving tests on the membership bit 
vectors of the two immediate successors, (2) duplicating the transitions from l\ for 
the case (C p = N) A (r = 0) into two transitions corresponding to the actions of 
leaving or staying in the active state with g to fix the choice, and (3) removing all 
the remaining statements involving m. 

Moreover, after the transformation above, guarded commands starting at Iq and 
l\ which involve the test t G ^ i can actually be compacted into trivial self loops. 
This leads to the program MA given in Figure One shows that this program 
simulates the program M3 by induction. 

4-3.4 Strengthening guards 

Before the counter abstraction step where identities of processes will be lost, we 
need to strengthen the guards using some invariants of the system. 

Lemma 4-2 

The following statements are invariants of the program MA: 

1. Vi (l [i] A t G = i A C + Ci > 2 x di A C p < N A r = 0) => do < C ) 

2. Vi (l [i] A t G = i A C + C x < 2 x d x A C p < N A r = 0) => d < C ) 

3. Vz A t G = i A C + C x > 2 x d Q A C p < N A r = 0) => d 1 <C 1 ) 
A. Mi \h[i] A t G = i A Co + C x < 2 x d A C p < N A r = 0) =^ d 1 <C 1 ) 

Invariant (1) in the lemma above says that when process P[i] is at location Iq, 
if it is the turn of this process to emit, and if it is allowed to emit in the round 
following the fault, then not all stations from the set So have emitted in that round. 
This is due to the fact that do counts stations from the set So that have emitted in 
that current round. 

Thus, at location l , the guard t G = i A C + Ci>2xdi A C p < N A r = 
can be strengthened into t G — i A C + C\ > 2 x d± A C p < N A r = A d < Co 
without changing the semantics of the program. We do similar transformation using 
the other invariants. 

Further, we update do and d\ in other transitions at Iq and l± so that they keep 
counting the number of stations from the set So , Si respectively, that have emitted 
in subsequent rounds. In that way, we obtain more invariants similar to the ones 
given in Lemma 14.21 and we use them to strengthen guards without changing the 
semantics of the program. 

For example, at Iq transition 

t G = i A C > Ci A C p = N — > emit\\,t%® ,C p := l,r ++ ; l 
is changed into 

t G = i A C > Ci A C P = N — > emit\\,t%®,do := 0,C P := l,r++; l 
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in N 
in fault 
in g 



int, N > 1 
bool. signal 
bool. signal 



Co 
Ci 

Cp 



int, init tG 
int, init 1 sg 
int, init 1 di 



int, init 1 r 
int, init do 
int, init 1 



P[i], where P[i] is the following program: 



loop forever do 

<G = i A 



?F 



end loop 
Fig. 4. Eliminating a' 



fault 
— > emit\\,t%®; l m 
tG = i A fault 
— > emii!!,ig ffi ,s G := to; h 
tc / i A emit?? A -i fault 

tc / i A emit?? A /aw/t 
— C++; Zr 

tG A emit?? A /au/t 



C++; Zo 



t G 



A Co + Ci > 2 x di A C p < N A r = 



emit!!, a®, C^ 



/o 



t G = i A Co > Ci A C p — N 

— > erra*!!,*g e ,C p := l,r ++ ; Z 

t G = i A Co > Ci A Cp < N A r > 

— > emit!!, tg e , C++; Z 

t G = i A C + Ci<2xdi A C p <N A r = 

Zf 



-iemit\\,t%®,C+ + ,C 



t G = i A Co < Ci A C„ 



AT 



-.emit!!,tg e ,Cp 



l,r , C ; If 



t G = i A C < Ci A Cp < N A r > 
— > ^emitU,t$ 9 ,C+ + ,C Q —; l F 
tG =fi i — * ; lo 



t G = i A Co + Ci > 2 x d A C p < N A r = 

— > emit\l,t® s> ,C+ + ,d++; h 

to = i A Ci > Co A C p = N Ar/0 

— > emit\\,t%® ,C P := l,r++; h 

to = i A Ci > Co A Cp = N A r = A -ig 



emit!!,tg®,Cp 



1,^ 



Ii 



t G = i A Ci>C A C p = N Ar = A g 
— > -.emit!!,t® e ,Cj, := l,r ++ ,Cf- ; Zf 
t G = i A Ci > Co A Cp < N A r > 
— > emit!!, t§®, C++; Zi 

t G = i A Co + Ci < 2 x d A Cp < N A r = 
— > ^emitll^^^^+^j^ - ; Z F 
t G = i A Ci < Co A Cp = N 



-iemit!!,tg 9 



, Cr) 



l,r++,Cr-; Zf 



t G = i A Ci < Co A Cp < N A r > 
— > -nemitV,^® ,C++ ,Cf" '; Z F 
tG ^ i — > ; Zi 

t G =i A Cp < N 

— > ^emit!!,tg®,C++; Z F 

t G = i A C p = N 

— ► -nemit!!,tg e ,C p := l,r++; Z F 

tG ^ i — > ; Zf 



locals, M4. 



int, init 
int, init 
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and transition 

t G = i A C > Ci A C p < N A r > 
is changed into 

t G = i A C > Cx A C p < N A r > 
It can be shown that 

Vi (Z [i] A tQ = i A C > Ci A C p = AT) =>■ d = C ) 
is an invariant and therefore the guard can be strengthened into 
t G =i A C > Ci A C p — N A d = C . 

Similarly to transitions starting from Zo and /i we need to strengthen the guards of 
the transitions starting from lp. For that, we introduce two supplementary counters 
d,F and Cf -initial value 0- which play a similar role for location lp as counters do 
and Co do for location lo. We use invariants similar to those given in the lemma 
above to strengthen all guards at lo, h and If- 

The last guard strengthening we perform makes use of Proposition that says 
that if the faulty station leaves the active state because of its first and second 
successor, then the two stations following it do not belong to S%, but to So- In 
other words, a station from Si emitting in the first round following the fault with 
g cannot be the first nor the second station following the faulty station. Therefore, 
at location l\, the guard of transition 

t G =i A Ci>C A Cp < N A r > — ► emitU, t%® , C++; h 

is strengthened into 

t G =i A Ci > Co A C p < N A r > A (r ^ 1 V ~^g V C p ^ 1 V C p ^ 2) without 
changing the semantics of the program. 

We obtain a transformed program Mb bisimilar to program MA. 

4-3.5 Counter abstraction 

We are now ready to perform the usual counter abstraction where individual control 
locations U n , la, l\ and lp are replaced by counters C„, Co, C\ and Cp respectively 
counting how many processes are at these locations and obtain a single extended 
finite state machine with a single location, which is then omitted l|Delzanno 2000). 

Then, we define a new program M6 obtained from the program M5 by (1) 
transforming transition I : g — ► a; I' into C\ > A g — > a,C ; ~~,C++, where 
are locations or control states and CC' the associated counters, (2) replacing 
any condition involving i by true, (3) packing into a single abstract transition all 
transitions from Mb where guards evaluate to true simultaneously. 

For example, the two transitions of Mb at U n 

t G =i A -i fault — > emitlljt^-Jin 

and 

t G ^ i A emitlt A -i fault — > ; ij n 



-f emit\\,t%®,C+ + ; l Q 
emii!!,t® e ,d++,C++; I, 
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give the abstract transition: 

C in > A emit?? A -^fault ► emit\l,t% @ . 

We obtain program M6 shown in Figure[5] The initial value of x guesses the number 
of processes that move from location li„ to l\. 

Such a counter abstraction simulates the previous models and verifies the prop- 
erty that the value of any counter Ci at some state a' is exactly the number of 
processes at location I in the corresponding state a. 

Thus, if we show that C\ or Co evaluate to in the program given in Figure 
we know that set So or S% is empty in the concrete model, establishing that there 
is no clique. 

4-4 Proving properties 

We have used the system given Figure [5] with a slight change concerning rounds, 
to prove automatically properties of the protocol. We have used an automaton 
where rounds are represented by states, rather than by the variable r. We make 
the distinction between the first round after the fault, and later rounds. Condition 
r = is equivalent to the automaton being in state roundl while r > is equivalent 
to the automaton being in state later round. Further, the automaton leaves state 
later round to go to a state normal when cliques are stable. 

A first property, called Ml, that has been proved as true, is that at the end of 
the first round after the fault: 

!(d = Co) (PI). 

PI means that, when the first round after the fault is over, either | Si |>|So | or 
So |>| Si |, whatever the original partition {Si, So} was when the fault occurred. 

We have analyzed what leads to Ci > Co, or Co > Ci after 1 round. 

First, we have shown that, if | Si |>| Sq | when the fault occurs, then Ci > Co 
after one round, and vice-versa. Adding the constraint x > N — x, we have proved 
that, at the end of the first round after the fault : 

(x = Ci) (P2). 

Since counters Ci and Co may not increase, this implies Ci > Co when r > 0. It 
also implies that all stations from Si did send in the first round. 

Then we have investigated the case Si | = | So | when the fault occurs. If set Si 
comes first in the statical order, then Ci > Co, and vice versa if So comes first. 
Adding the constraint x — N — x we have proved: 

AG ((di = x and d < x) AG (Ci = x)) (PS), 

AG ((di = x and d < x) => (d + C - 2 * d l <= 0)) (PA). 

Again, this implies Ci > Co after the first round. It also implies that all stations 
from Si did send in the first round. 
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in N 
in fault 
in g 



int, N > 1 
bool. signal 
bool. signal 
int, x > 1 



Co 
Ci 
Op 

a 



to 

SG 

di 



int, init 1 
int, init 
int, init 1 
int, init 



r 

d 

d F 



^fault 



int, init 
int, init 1 
int, init 1 

in x : int, x > 1 Ui : int, init N 

loop forever do 

d > A emit?? A 
— > errat!!,tg e 
Ci > A emit?? A /au/t 

— > emit!!,t® ffi ,s^ :=t G ,Cj := x, C ~ N - x,C[ := 
Co > A do < Co A Co + Ci > 2 x di A C p < N A r = 
— > emit!!, t®®, C++, d++ 
Co > A d = Co A Co > Ci A C p = JV 
— > emit!!,t® ffi ,C* p := l,r++,d := 1, < := 0,d' F := 
Co > A d < Co A Co > Ci A C p < N A r > 
— > emit!!, tg®, C++, d++ 

Co > A d < Co A Co + Ci < 2 x di A C p < N A r = 
— > ^emit!!,t® ffl ,C++,C ( 7",C++,d++ 
Co > A d = Co A Co < Ci A C p = TV 

— > ^emit!!,t® ffl ,C; := 1, r++, C ~ , C++, d := 0, d' F := := 
Co > A d < Co A Co < Ci A C p < N A r > 

C ' ' 





int, init 
int, init 
int, init 



nemit\\,t$ <B ,C+ + ,Co 



Ci > A di < Ci A Co + Ci > 2 x d A C p < iV A r 



— > emit\\,tl 
Ci > A di 



5 > C++ , d++ 

: Ci A Ci > Co A Cp 



N A (r / V 



— > emit!!,tg®,C; := l,r++,di := l,d' := 0,d' F := 
Ci > A di = Ci A Ci > Co A Cp — N A r = A g 
— > ^emit\\,t%®,C'p := l,r ++ ,d[ := 0,d' F := l,d' := 0, C++ 
Ci > A di < Ci A Ci > Co A C p < N A r > A 
(r / 1 V -.g V C p / 1 V C p / 2) 
— > emit!!, t|®, C++, d++ 

Ci > A di < Ci A Co + Ci < 2 x d A C p < N A r = 
— > ^emit!!,tg e ,C++,Cr",C++,d++ 
Ci > A di = Ci A Ci < Co A Cp = N 



-emit\\,t% & ,C'p := 1, r++, Cf" , C++, d F := !,< := 0,d := 



Ci > A di < Ci A Ci < Co A Cp < N A r > 
— > ^emit\\,t%® ,C++ ,C^ 
C F > A d F <C F A C p < N 



,d++ 



end loop 



Cf > A dp = C F A C p = JV 
— > -nemit!!,tg e ,C; := l,r++,d' F 



= l,d := 0,di := 



Fig. 5. The abstract model M6. 



To prove the main property, we have first shown that, if r > 0, when one round 
is completed, it is not possible to start a new round where both C\ and Co are not 
0, i.e., S\ and So are both not empty. 

Indeed the property below is true when r > 0: 



AG -i(-i(Ci = 0) and ->(C = 0) and (C p = N)) (P6). 
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Finally, we proved that, at the end of the second round after the fault, i.e., when 
C p = N and the automaton goes to state normal: 

AG (Ci = or C = 0) {PI). 

PI means that either S\ or So is empty at the end of the second round. Hence, all 
active stations have the same membership vectors at the end of the second round 
and form again a single clique in the graph theoretical sense. 

5 Automatic Verification: the k Faults Case 

To be able to calculate precisely CAcc s and CFail s using global counters only, 
see Theorem |SJ is what makes possible the construction of an abstract model in 
the 1-fault case. For the fc-faults case, the same approach can be taken. Provided 
that one is able to calculate precisely CAcc s and CFail s using global counters, first 
a behavioural model for a scenario of the fc-fault case is constructed, then this 
behavioural model can be transformed into successive models that simulate each 
other, replacing local variables with global counters, till the final counter abstraction 
is obtained and verified automatically, as has been done for the 1-fault case. Thus, 
what is needed is to establish a generalisation of Theorem |H1 for the fc-faults case. 

5.1 Calculating CAcc s and CFail s 

Let 1 < i < k. By Proposition^ after the occurrence of fault i, W, the set of active 
stations, is partitioned into sets S w with w G {0, 1}\ We find it handy for the 
following to indicate the length of the string w with the superscript i. We associate 
two counters Cvf and dw % to each set S w i that is formed after the occurrence 
of any fault i. The counters Cw % counts how many stations belong to set S w i 
when fault i occurs. The counters dw % count how many stations from the set S w i 
have sent between fault i and fault i + 1 in case i < fc, and counts how many 
stations from the set S w i have sent so far in the first round following fault k in 
case i = k. Again because of Proposition |3 we assume that, for any w G {0, l} t_1 , 
CV^l + CV _1 = Cw\ Cw 1 - 1 ! > 1 and dw^l > 1. Further, for each fault i, 
we associate a counter Cp(i) that counts how many time slots have elapsed since 
fault i. 

The creation of counters is illustrated taking a particular scenario in Figured 
After 1 fault, active stations split into two sets, So, the stations that have not 
received the frame correctly and Si the stations that have received the frame cor- 
rectly. Four counters are created: Cqi contains the number of stations from So, Cyi 
contains the number of stations from Si, doi counts the stations from So that are 
sending frames, dii counts the stations from Si that are sending frames, dit > 1 
since the station that was sending when the fault occurred is from Si . Each time a 
station from Si, respectively from So is prevented from sending, then Si and Cii, 
respectively So and C i, are decreased by 1. Suppose that a second fault occurs 
when some station from Si emits. Then Si splits into Sio and Sn and six new 
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CVijdii CqIj^q 1 fault 1, C p (l) 

Cn2,d 11 2 Cio 2 ; ^io 2 Coo 2 ) ^oo 2 fault 2, C p (2) 

Ciio 3 5^no 3 Cioo 3 1 ^ioo 3 Cooi^^ooi 3 C 000 3,d 000 3 fault 3, C p (3) 




Cnoo 4 ) ^noo 4 / \ Cooio 4 1 ^ooio 4 Coooo 4 1 ^oooo 4 fault 4, C p (4) 
Ciooi 4 1 ^iooi 4 Ciooo 4 ; ^iooo 4 

Fig. 6. Illustrating the creation of counters up to 4 faults. 

^acc-s, ^"fail-s 



j+1 



dw ' d\v k ,Jw k 



Fig. 7. Evaluating CAcc s and CFail s after fault k. 

counters are created: C o 2 , which is, in this scenario, initially equal to the present 
value of C Q i , C 10 2 , to count the number of stations in Sio, C\\i to count the number 
of stations in Sn, d 00 2 to count the number of stations from Soo (which is, in this 
scenario, the same as So) that are sending frames after the second fault - note that 
d i is not incremented anymore after the occurence of fault 2- d 10 2 to count the 
number of stations from Sio that are sending frames after the second fault, and d\\i 
to count the number of stations from 5ii that are sending frames after the second 
fault. Again, d 11 2 > 1 since the station that was sending when the fault occured is 
from S\\. In Figure E| one assumes that a third fault occurs when a station from 
set Soo is sending and a fourth fault occurs when a station from set Sioo is sending. 

Thus, for k faults and a particular scenario, 2(i + 1) + k counters so far are 
created. 

These counters are almost enough to know CAcc s and CFail s for any station 
s in the first round following fault k. Indeed, let s be a station ready to send, s 
belongs to some set S w k. In the rounds preceding fault k and during the round 
following fault k, s recognizes as correct frames sent by stations from S w >, where 
w' is a prefix of w k , and recognizes as incorrect all other frames. This information 
is recorded with the counters dw 1 , w £ {0, l} 1 and 1 < i < k. 

There is one more subtlety. The clique avoidance mechanism needs that CAcc s 
and CFail s count one round only, the round being relative to the position of the 
sending station s. To do so properly, we distinguish two cases. 

The first case is when fault k occurs in the first round following fault 1 and the 
time slot of s still belongs to that round. One must take into account that s has 
recognized as correct all stations that have sent before fault 1, which is a straight 
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generalization of Theorem [5] For example, consider the scenario of Figure [B| and a 
station s from set 6*0010 ■ Then: 

CAcc s =\W\ -d x i -dn2 -d W 2 -d 1W 3 -d wa 3 -d oo 3 -^noo 4 -diooi 4 -^iooo 4 -^oooo 4 > 
CFail s = d x i + d U 2 + d W 2 + d no s + d wo s + rf oo 3 + ^noo 4 + o^iooi 4 + ^iooo 4 + ^oooo 4 - 
The second case is when the time slot of the sending station s lies between station 
s z and s l+1 and fault k occurs in the first round following fault i, i > 1. After fault i, 
s belongs to some set S w i and the number of frames accepted as correct by s is given 
by dw l . However, to count correctly CAcc s , dw l is too much. One has to withdraw 
all stations accepted by s whose time slots are between s l and s. This is illustrated 
in Figured We introduce auxiliary counters d A w k and d F w k . These counters are 
set to when fault k occurs. Counter d A w k counts how many stations from set 
S w k have sent so far, as counters dw k do, and counter d F w k counts how many 
stations from set S w k were prevented from sending so far by the clique avoidance 
mechanism and moved to the set of non- working stations. The difference with dw k 
is that these counters are reset to each time a counter Cp(i) reaches N after 
fault k. Thus dw l — T, w ,kd A w' k — T, w ,kd F w' k , with w i a prefix of w' k , gives exactly 
how many frames between s and s i+1 the station s has recognized as correct in 
the round, and dw 1 — Yi w ,kd A w' k — T lw ,kd F w' k + dw l+1 + • • • + dw k gives exactly 
how many frames in total s has recognized as correct in the round, i.e., CAcc s . For 
example, consider again the example illustrated in Figure El and a station s from 
set 5*0010 • Suppose fault 4 occurs in the first round following fault 2 and station s 
lies between s 2 and s 3 , the stations that were emitting when fault 2, respectively 
fault 3, occurred. Then: 

CAcc s = d 00 2 + d 001 3 + d 0010 4 — d A QW 4 — d A Q0Qi — d F 01Qi — d F 00Q4 . 

A similar idea works for CFail s . This is formally stated in the proposition below. 

Proposition 9 

We indicate by w s that the string w refers to an entity where s belongs. 

1. Let s £ S w k a station ready to send in the round following fault k. 

(a) If Cp(\) < N at the time slot of s, then: 
CAcc s =\W\ —T, w idw', and CFail s — 'E w idw\ 
where w' must not be a prefix of w k . 

(b) Let 1 < i < k such that Cp(i) > N and Cp(i + 1) < TV at the 
time slot of s. Then: 

CAcc s = (T, 3 j~ k dwi) — T, w rkd A w' k — T, w ,kd F w' k , where w J s is a 
prefix of w k and w l s is a prefix of all w' k , and 
CFail s = (S^/E^^^dw- 7 ) - Y< w k^ w kd A w k - Y< w k^ w kd F w k , 
where w k must be a suffix of some w l ^ w\. 

2. Let s £ S*k a station ready to send in the second round following fault k. 
Then: 

CAcc s = dw k — d F w k , and 

CFail s = Ti w ,kdw' k — T, w ,kd F w' k with w' k ^ w k . 
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Proof 

For^ the proof follows what has been exposed informally above. For|2 one assumes 
counters dw k are kept as there are at the end of the first round, and that d F w k are 
reset to and incremented during the second round each time some station from set 
S w k is prevented from sending. The result follows, since dw k contains exactly the 
number of stations from set S w h that have sent in the first round following fault k. 
□ 

5.2 Complexity issues 

It follows that, for one scenario of the k faults case, the total number of counters 
needed is S|zf2(z + 1) + k + 2(k + 1). There is Tl*Z k i possible scenarios to check. 

Using all these counters, an extended automaton similar to the one given in Fig- 
ure [S] can be designed and, in theory, automatically verified. Properties analogous 
to P6 and P7 have to be checked to prove that after the second round follow- 
ing fault k, there is only 1 clique. However, in practice, tools that are presently 
available do not make it possible to handle such a number of counters already 
for two faults 1 . Though, a scenario for two faults has been successfully verified in 
HBardin et al. 2 004) using their tool FAST after performing further ad hoc abstrac- 
tions to reduce the number of counters. 

6 Conclusion 

We have proposed an approach for verifying automatically a complex algorithm 
which is industrially relevant. The complexity of this algorithm is due to its very 
subtle dynamic which is hard to model. We have shown that this dynamic can 
be captured by means of unbounded (parametric) counter automata. Even if the 
verification problem for these infinite-state models is undecidable in general, there 
exists many symbolic reachability analysis techniques and tools which allow to 
handle such models. 

Our approach allows to build a model (counter automaton) for the algorithm 
with an arbitrary number n of stations, but for a given number k of faults. We have 
experimented our approach by verifying in a fully automatic way the model in the 
case of one fault, using the ALV tool and the LASH tool. 

Related Work: (Bau er and Paulitsch 2000| provides a manual proof of the al- 
gorithm in the 1 fault case. Theorem 01 generalizes this result to the case of any 
number of faults. As far as we know, all the existing works on automated proofs 
or verifications of the membership algorithm of TTP concern the case of one fault, 
and only symmetric fault occurrences are assumed. In our work, we consider the 
more general framework where several faults can occur, and moreover, these faults 
can be asymmetric. In (Pfeifer 2000), a mechanised proof using PVS is provided. 
l|Katz et al. 199 7 Bau kus et al. 20001 ICreese and Roscoe 19 99) adopt an approach 

1 In the case of two faults, we got memory problems both with ALV and LASH. 
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based on combining abstraction and finite-state model-checking. IjKatz et al. 1997[) 
has checked the algorithm for 6 stations. ( Bau kus et al. 2000llCreese and Roscoe 1999 ) 
consider the parametric verification of n stations; IjCreese and Roscoe 1 999) pro- 
vides an abstraction proved manually whereas IjBaukus et al. 2000)l uses an auto- 
matic abstraction generation technique, both abstractions leading to a finite-state 
abstraction of the parameterized network. The abstractions used in those works 
seem to be non-extensible to the case of asymmetric faults and to the k faults case. 
To tackle this more general framework, we provide an abstraction which yields a 
counter automaton and reduce the verification of the algorithm to the symbolic 
reachability analysis of the obtained infinite-state abstract model. Moreover, our 
abstraction is exact in the sense that it models faithfully the emission of frames by 
stations. 

Future Work: Our future work is to automatize, for instance using a theorem 
prover, the abstraction proof which allows to build the counter automaton modeling 
the algorithm. More generally, an important issue is to design automatic abstraction 
techniques allowing to produce infinite-state models given by extended automata. 
Finally, a challenging problem is to design an algorithmic technique allowing to 
verify automatically the algorithm by taking into account simultaneously both of 
its parameters, i.e., for any number of stations and for any number of faults. 

Acknowledgment : We thank anonymous referees for their helpful comments. 
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